Artificial Neural Networks

Rob Williams

November 14, 2018

Artificial neural networks can be used to both regression and classification tasks. Today we're going to be working with the MNIST data of handwritten digits.1

Our first task is to get load our data and get it into the correct format for our network. Load numpy for array multiplication.

In [1]:
%matplotlib inline
import numpy as np

Next we need to actually download our data and preprocess it for our neural network. Load fetch_mldata from sklearn.datasets and use the fetch_openml() function to download the data.

In [2]:
from sklearn.datasets import fetch_openml

# download MNIST from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

# normalize the digits; maximum value is 255
X = X/255

# print dimensionality of our data
print(X.shape)
print(y.shape)

# K
digits = 10
(70000, 784)
(70000,)

Getting our data into the right format for our neural network is not a trivial task. Since we have 10 digits we're predicting, our network will have 10 output nodes. We need to convert our $y \in \{0,\ldots,9\}$ categorical vector into a matrix where each row is a one of k vector. In the machine learning literature, this is referred to as a one hot encoding of the data. Use the OneHotEncoder() function in sklearn.preprocessing to create an encoder with sparse = False and transform our data.

In [3]:
from sklearn.preprocessing import OneHotEncoder

# convert to one hot array
encoder = OneHotEncoder(sparse = False, categories='auto')
y = encoder.fit_transform(y.reshape(-1, 1))

With that taken care of, we need to split our data into training and test sets, so use train_test_split() from sklearn.model_selection to create a training set of 60,000 observations.

In [4]:
from sklearn.model_selection import train_test_split

# create 10,000 observation test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = int(6e4))

# transpose X objects for matrix multiplication later
X_train, X_test = X_train.T, X_test.T

Let's make sure our data survived this process intact, so plot the first observation in X_train using matplotlib.pyplot with a binary color map from matplotlib.cm.

In [5]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# plot to inspect data
plt.imshow(X_train[:, 0].reshape(28,28), cmap = cm.binary)
Out[5]:
<matplotlib.image.AxesImage at 0x1714da780>

In a neural network each unit calculates the weighted sum of its inputs $$a_j \sum_i w_{ji}z_i$$ which is then passed through a nonlinear activation function $h(\cdot)$ so that $$z_j = h(a_j)$$ We're going to build a two layer network so the full equation is $$y_k(\mathbf{x},\mathbf{w}) = \sigma\left( \sum_{j=1}^M w_{kj} h\left( \sum_{i=1}^D w_{ji} x_i + w_{j0} \right) + w_{k0} \right)$$ where $\sigma(\cdot)$ is the softmax function, $h(\cdot)$ is the sigmoid function, and $k=\{1,\ldots,10\}$ for the ten possible digits. Load the expit() function from scipy.special to get access to the sigmoid function.

In [6]:
from scipy.special import expit as sigmoid

The output function in our network is the softmax function which takes a vector of real values and converts it to a sum to one vector of probabilities $$f(x_k) = \frac{e^{x_k}}{\sum_{k=1}^{K} e^{x_k}}$$ Unfortunately, numpy doesn't have a builtin version. Define one below

In [7]:
def softmax (x):
    return(np.exp(x)/np.exp(x).sum())

We also need a way to measure the accuracy of our network at each epoch. The most common loss function in multiclass classification is the categorical cross entropy function

\begin{align} E(\mathbf{w}) &= -\sum_{n=1}^N\{t_n \ln y(\mathbf{x}_n, \mathbf{w}) + (1 - t_n) \ln (1 - y(\mathbf{x}_n, \mathbf{w}))\} \\ &= -\frac{1}{N}\sum_{i=1}^n\sum_{j=1}^m y_{ij} \ln(y(\mathbf{x}_n, \mathbf{w})) \end{align}

Fill in the code below for the function loss() to calculate categorical cross entropy. Note that np.multiply() computes the inner sum.

In [8]:
def loss(true, predicted):
    L_sum = np.sum(np.multiply(true, np.log(predicted)))
    L = -1 * (L_sum/true.shape[0])
    return(L)

To calculate the derivatives of each of the weights, and the evaluate gradient of the error, we need to calculate the errors at each layer of the network.

The error at the output is given by $$ \delta_k = y_k - t_k $$

while the error at the hidden layer is given by

$$ \delta_j = h'(a_j) \sum_k w_{kj} \delta_k \label{deriv_hidden} $$

This lets us find the derivative of the input weights

$$ \frac{\partial E_n}{\partial w_{ji}} = \delta_jx_i $$

and the derivative of the output weights is

$$ \frac{\partial E_n}{\partial w_{kj}} = \delta_kz_j $$

keeping in mind that because we're doing batch gradient descent, each matrix operation needs to be divided by the number of observations in our data.

Finally, we update all of our weights with

$$ \mathbf{w}^{\tau + 1} = \mathbf{w}^{\tau} -\eta \frac{\partial E_n}{\partial w} $$

Now we have all the tools we need to create a neural network. Fill in the code below to complete our network function.

In [9]:
def network(X, y, epochs, eta):
    
    # intialize weights and biases
    W1 = np.random.randn(64, X_train.shape[0])
    b1 = np.zeros((64, 1))
    W2 = np.random.randn(digits, 64)
    b2 = np.zeros((digits, 1))
    
    # iterate through epochs
    for e in range(epochs):
        
        # calculate feed forward predictions
        A1 = np.matmul(W1, X) + b1
        Z1 = sigmoid(A1)
        A2 = np.matmul(W2, Z1) + b2
        Z2 = np.apply_along_axis(softmax, 0, A2)
        
        # calculate derivatives for second weight layer
        dA2 = Z2 - y.T
        dW2 = np.matmul(dA2, Z1.T) / X.shape[1]
        db2 = np.sum(dA2, axis=1, keepdims=True) / X.shape[1]
        
        # calculate derivatives for first weight layer
        dZ1 = np.matmul(W2.T, dA2)
        dA1 = dZ1 * sigmoid(A1) * (1 - sigmoid(A1))
        dW1 = np.matmul(dA1, X.T) / X.shape[1]
        db1 = np.sum(dA1, axis=1, keepdims=True) / X.shape[1]
        
        # gradient update to weights
        W2 -= eta * dW2
        b2 -= eta * db2
        W1 -= eta * dW1
        b1 -= eta * db1
        
        # print loss every 10th iteration
        if (e % 10 == 0): print('Epoch', e, 'loss:', loss(y, Z2.T))
    
    # return weights in dictionary for prediction
    return({'W1':W1, 'b1':b1, 'W2':W2, 'b2':b2})

Now it's time to train our neural network. Run the network for 100 epochs with $\eta=.1$ and watch the loss decrease! Be sure to assign the result to an object so we can hold onto those weights.

In [10]:
nn = network(X_train, y_train, epochs=100, eta=.1)
Epoch 0 loss: 10.211565678000099
Epoch 10 loss: 5.714484433132308
Epoch 20 loss: 4.883599631809677
Epoch 30 loss: 4.2433035056403385
Epoch 40 loss: 3.7126016230607974
Epoch 50 loss: 3.355591540252031
Epoch 60 loss: 3.1120338123883253
Epoch 70 loss: 2.910384760276315
Epoch 80 loss: 2.7358225563535927
Epoch 90 loss: 2.5834294307677803

We can see that the loss continues to decrease every 10th iteration, but we still don't know how well our network does at predicting the data.

Network performance

To anwer this question, we need to be able to make predictions from our network i.e. $y(\mathbf{x}, \mathbf{w}_{\text{network}})$ where $\mathbf{w}_{\text{network}}$ are the weights be obtain by training our network. Define a function feed_forward() to automate the process of forward propagation below so we can easily make predictions from our network.

In [11]:
def feed_forward(X, weights):
    A1 = np.matmul(weights['W1'], X) + weights['b1']
    Z1 = sigmoid(A1)
    A2 = np.matmul(weights['W2'], Z1) + weights['b2']
    Z2 = np.apply_along_axis(softmax, 0, A2)
    return(Z2)

Now use feed_forward() to calculate $y(\mathbf{x}, \mathbf{w}_{\text{network}})$ and convert our one of $K$ matrix y_test back to a categorical vector for comparison to our predictions.

In [12]:
predictions = np.argmax(feed_forward(X_test, nn), axis = 0)
labels = np.argmax(y_test, axis=1)

Once we've done this, we can use confusion_matrix() and classification_report() from sklearn.metrics to compute a confusion matrix and produce a classification report for each digit.

In [13]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(predictions, labels))
print(classification_report(predictions, labels))
[[2595  150  666  252  226  243  270  996  200  486]
 [ 150 3875  297  615  412  159  288  344  232  849]
 [ 548  241 1560 1163  248  692  422  217  861  278]
 [ 527  801 1098 1257  159  344  195  398  787  174]
 [ 169  227  307   76 2113  245  365  426  590 1365]
 [ 308  195  461  451  252 1300  216  439  677  586]
 [ 382  148  509  481  442  394 3017   96  313  116]
 [ 652  174  275   64  795  675  364 2199  431  806]
 [ 483  744  400  938  289  625  213  238 1419  301]
 [  84  192  382  800  931  718  558  889  383 1037]]
              precision    recall  f1-score   support

           0       0.44      0.43      0.43      6084
           1       0.57      0.54      0.55      7221
           2       0.26      0.25      0.26      6230
           3       0.21      0.22      0.21      5740
           4       0.36      0.36      0.36      5883
           5       0.24      0.27      0.25      4885
           6       0.51      0.51      0.51      5898
           7       0.35      0.34      0.35      6435
           8       0.24      0.25      0.25      5650
           9       0.17      0.17      0.17      5974

   micro avg       0.34      0.34      0.34     60000
   macro avg       0.34      0.33      0.33     60000
weighted avg       0.34      0.34      0.34     60000

Speeding things up

We're not doing fantastically here, and our loss decreased pretty slowly. One of the easiest ways to speed up any gradient descent algorithm is to implement stochastic gradient descent, which randomly cycles through the observations in the data and updates the weights based on results from one observation at a time. We're going to implement a compromise between stochastic and batch gradient descent known as stochastic batch gradient descent where in each epoch we shuffle the data, split it into $b$ batches, and then evaluate the network and update the weights one batch at a time.

Complete the code below to define the function network_sgd(), which implements small batch stochastic gradient descent. Note that in the interest of clarity, this code is not robust and will break if you run it when the number of observations is not evenly divisible by the number of batches.

In [14]:
def network_sgd(X, y, epochs, batches, eta):
    
    # calculate batch size
    batch_size = int(X.shape[1] / batches)
    
    # intialize weights and biases
    W1 = np.random.randn(64, X.shape[0])
    b1 = np.zeros((64, 1))
    W2 = np.random.randn(digits, 64)
    b2 = np.zeros((digits, 1))
    
    # iterate through epochs
    for e in range(epochs):
        
        # shuffle observations
        permutation = np.random.permutation(X.shape[1])
        X_shuffled = X[:, permutation]
        y_shuffled = y[permutation, :]
        
        # iterate through batches
        for i in range(batches):
            
            # subset batch i
            begin = i * batch_size
            end = min(begin + batch_size, X.shape[1] - 1)
            X_batch = X_shuffled[:, begin:end]
            y_batch = y_shuffled[begin:end, :]
            
            # calculate feed forward predictions
            A1 = np.matmul(W1, X_batch) + b1
            Z1 = sigmoid(A1)
            A2 = np.matmul(W2, Z1) + b2
            Z2 = np.apply_along_axis(softmax, 0, A2)
            
            # calculate derivatives for second weight layer
            dA2 = Z2 - y_batch.T
            dW2 = np.matmul(dA2, Z1.T) / X_batch.shape[1]
            db2 = np.sum(dA2, axis=1, keepdims=True) / X_batch.shape[1]
            
            # calculate derivatives for first weight layer
            dZ1 = np.matmul(W2.T, dA2)
            dA1 = dZ1 * sigmoid(A1) * (1 - sigmoid(A1))
            dW1 = np.matmul(dA1, X_batch.T) / X_batch.shape[1]
            db1 = np.sum(dA1, axis=1, keepdims=True) / X_batch.shape[1]
            
            # gradient update to weights
            W2 -= eta * dW2
            b2 -= eta * db2
            W1 -= eta * dW1
            b1 -= eta * db1
        
        # print loss for entire dataset every 10th epoch
        if (e % 10 == 0):
            y_pred = feed_forward(X, {'W1':W1, 'b1':b1,
                                      'W2':W2, 'b2':b2}).T
            print('Epoch', e, 'loss:', loss(y, y_pred))
    
    # return weights in dictionary
    return({'W1':W1, 'b1':b1, 'W2':W2, 'b2':b2})

Now let's train our network with batch stochastic gradient descent and check out the improvements.

In [15]:
nn_sgd = network_sgd(X_train, y_train, epochs=100, batches = 10, eta=.1)

# caluclate predictions
predictions_sgd = np.argmax(feed_forward(X_test, nn_sgd), axis = 0)

print(confusion_matrix(predictions_sgd, labels))
print(classification_report(predictions_sgd, labels))
Epoch 0 loss: 5.443446758876805
Epoch 10 loss: 2.3624232272524974
Epoch 20 loss: 1.6061814429586339
Epoch 30 loss: 1.2658504330687346
Epoch 40 loss: 1.076473857112683
Epoch 50 loss: 0.9547409361174736
Epoch 60 loss: 0.8685446749005787
Epoch 70 loss: 0.8035998770588519
Epoch 80 loss: 0.7523935513761765
Epoch 90 loss: 0.7106095413808385
[[5103    1  119   79   35  228  122   53   88   96]
 [   2 6318   92   62   27  124   33  110  210   32]
 [  63   97 4203  363  115  137  345  102  300   72]
 [  75   97  289 4380   41  579   64  105  417  149]
 [  30   16  177   55 4536  206  215  169  105  571]
 [ 323   72  119  401   99 3213  228   65  473  140]
 [ 165   27  347   84  176  182 4812   13   83   42]
 [  32   17  136  119  120  160   12 4893  110  663]
 [  85   89  396  406  110  401   71  101 3857  137]
 [  20   13   77  148  608  165    6  631  250 4096]]
              precision    recall  f1-score   support

           0       0.87      0.86      0.86      5924
           1       0.94      0.90      0.92      7010
           2       0.71      0.73      0.72      5797
           3       0.72      0.71      0.71      6196
           4       0.77      0.75      0.76      6080
           5       0.60      0.63      0.61      5133
           6       0.81      0.81      0.81      5931
           7       0.78      0.78      0.78      6262
           8       0.65      0.68      0.67      5653
           9       0.68      0.68      0.68      6014

   micro avg       0.76      0.76      0.76     60000
   macro avg       0.75      0.75      0.75     60000
weighted avg       0.76      0.76      0.76     60000

Each epoch takes a little longer than in batch gradient descent. Python's matrix multiplication is fast so it's more efficient to multiply a matrix with 60,000 rows by our first layer of weights than it is to multiply a matrix with 6,000 rows by our weights vector 10 times in a loop.

However, we end up with a much smaller loss after 100 epochs than we did with batch gradient descent, and our classification report looks much better. Fill in the code below to calculate the increase in correctly predicted digits.

In [16]:
np.diag(confusion_matrix(predictions_sgd, labels)).sum() - np.diag(confusion_matrix(predictions, labels)).sum()
Out[16]:
25039

Don't overdo it

The flexibility of neural networks can also be a curse. Using $1/7$ of the total data, we're able to achieve $> 50\%$ precision on all digits with only 100 training epochs. Due to their ability to approximate any continuous function, they can very easily overfit noise in our training data and do poorly on out of sample prediction. The network we've been using so far is pretty good in avoiding this behavior, but let's break a few things to illustrate the problem. Increase the number of nodes in the hidden layer.

Let's also calculate the validation loss every 10th epoch in addition to our training loss, and then store both along with the epoch number. The training loss is a non-increasing function of the epoch, but the validation loss often begins to increase after an initial decrease.

In [17]:
def network_val(X_train, X_test, y_train, y_test, epochs, batches, eta):
    
    # create empty lists to hold training and validation loss
    train_loss = list()
    test_loss = list()
    
    # caluclate batch size from data
    batch_size = int(X_train.shape[1] / batches)
    
    # initialize weights and biases
    W1 = np.random.randn(64, X_train.shape[0])
    b1 = np.zeros((64, 1))
    W2 = np.random.randn(digits, 64)
    b2 = np.zeros((digits, 1))
    
    # iterate through epochs
    for e in range(epochs):
        
        # shuffle observations
        permutation = np.random.permutation(X_train.shape[1])
        X_shuffled = X_train[:, permutation]
        y_shuffled = y_train[permutation, :]
        
        # iterate through batches
        for i in range(batches):
            
            # subset batch i
            begin = i * batch_size
            end = min(begin + batch_size, X_shuffled.shape[1] - 1)
            X_batch = X_shuffled[:, begin:end]
            y_batch = y_shuffled[begin:end, :]
            
            # calculate feed forward predictions
            A1 = np.matmul(W1, X_batch) + b1
            Z1 = sigmoid(A1)
            A2 = np.matmul(W2, Z1) + b2
            Z2 = np.apply_along_axis(softmax, 0, A2)
            
            # calculate derivatives for second weight layer
            dA2 = Z2 - y_batch.T
            dW2 = np.matmul(dA2, Z1.T) / X_batch.shape[1]
            db2 = np.sum(dA2, axis=1, keepdims=True) / X_batch.shape[1]
            
            # calculate derivatives for first weight layer
            dZ1 = np.matmul(W2.T, dA2)
            dA1 = dZ1 * sigmoid(A1) * (1 - sigmoid(A1))
            dW1 = np.matmul(dA1, X_batch.T) / X_batch.shape[1]
            db1 = np.sum(dA1, axis=1, keepdims=True) / X_batch.shape[1]
            
            # gradient update to weights
            W2 -= eta * dW2
            b2 -= eta * db2
            W1 -= eta * dW1
            b1 -= eta * db1

        # record training and validation error every 10th epoch
        if (e % 10 == 0):
            
            # calculate predictions on training and validation data
            y_pred_train = feed_forward(X_train, {'W1':W1, 'b1':b1,
                                               'W2':W2, 'b2':b2})
            y_pred_test = feed_forward(X_test, {'W1':W1, 'b1':b1,
                                                'W2':W2, 'b2':b2})
            
            # append training and validation loss to list
            train_loss.append(loss(y_train, y_pred_train.T))
            test_loss.append(loss(y_test, y_pred_test.T))
    
    # return an array of epoch, training loss, and validation loss
    return(np.column_stack([np.arange(0, epochs, 10),
                            np.column_stack([np.array(train_loss),
                                             np.array(test_loss)])]))

To really illustrate this behavior, significantly increase the learning rate (remembering that $\eta \in (0,1]$), and train this network for 500 epochs.

In [18]:
tv = network_val(X_train, X_test, y_train, y_test, epochs=500, batches = 10, eta=.99)

Finally, let's plot the training and validation loss as a function of epoch.

In [19]:
# plot training error
plt.plot(tv[:,0], tv[:,1])

# plot validation error
plt.plot(tv[:,0], tv[:,2])

# add legend
plt.legend(['training error', 'validation error'], loc='upper right')

# display plot
plt.show()

We can use this property of the loss to stop training our network when the validation loss begins to increase. If we do so, we can avoid overfitting our network to the training data.

  1. This notebook is adapted from these tutorials.